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BACKGROUND 

1. Technical Field 

The present invention generally relates to separating signal sources and, in 
particular, to online blind separation of multiple sources. 

2. Background Description 

The separation of independent sources from an array of sensors is a classic but 
difficult problem in signal processing. Generally, the signal sources as well as their 
mixture characteristics are unknown. Without knowledge of the signal sources, other than 
a general assumption that the sources are independent, the signal processing is commonly 
known in the art as the "blind separation of sources". The separation is "blind" because 
nothing is assumed about the independent source signals, nor about the mixing process. 

A typical example of the blind separation of source signals is where the source 
signals are sounds generated by two independent sources, such as two (or more) separate 
speakers. An equal number of microphones (two in this example) are used to produce 
mixed signals, each composed as a weighted sum of the source signals. Each of the 
source signals is delayed and attenuated in some unknown amount during passage from 
the speaker to a microphone, where it is mixed with the delayed and attenuated 



components of the other source signals. Multi-path signals, generated by multiple 
reflections of the source signals, are further mixed with direct source signals. This is 
generally known as the "cocktail party" problem, since a person generally wishes to listen 
to a single sound source while filtering out other interfering sources, including multi-path 
signals. 

According to the prior art, a blind source separation technique that allows the 
separation of an arbitrary number of sources from just two mixtures provided the time- 
frequency representations of sources do not overlap is described by Jourjine et al., in 
"Blind Separation of Disjoint Orthogonal Signals: Demixing N Sources from 2 
Mixtures", in Proceedings of the 2000 IEEE International Conference on Acoustics, 
Speech, and Signal Processing, Istanbul, Turkey, June 2000, vol. 5, pp. 2985-88, June 
2000. This technique is hereinafter referred to as the "original DUET algorithm". The 
key observation in the technique is that, for mixtures of such sources, each time- 
frequency point depends on at most one source and its associated mixing parameters. In 
anechoic environments, it is possible to extract the estimates of the mixing parameters 
from the ratio of the time-frequency representations of the mixtures. These estimates 
cluster around the true mixing parameters and, identifying the clusters, one can partition 
the time-frequency representation of the mixtures to provide the time-frequency 
representations of the original sources. 

The original DUET algorithm involved creating a two-dimensional (weighted) 
histogram of the relative amplitude and delay estimates, finding the peaks in the 
histogram, and then associating each time-frequency point in the mixture with one peak. 



The original implementation of the method was offline and passed through the data twice; 
one time to create the histogram and a second time to demix. 

Accordingly, it would be desirable and highly advantageous to have an online 
method for performing blind source separation of multiple sources. Moreover, it would 
be further desirable and highly advantageous to have such a method that does not require 
the creation and updating of a histogram or the locating of peaks in the histogram. 

SUMMARY OF THE INVENTION 

The problems stated above, as well as other related problems of the prior art, are 
solved by the present invention, a method for online blind separation of multiple sources. 

The present invention provides an online version of the DUET algorithm that 
avoids the need for the creation of the histogram, which in turn avoids the computational 
load of updating the histogram and the tricky issue of finding and tracking peaks. The 
advantages of the present invention over the prior art, in particular, the original DUET 
algorithm include: online (5 times faster than real time); 15 dB average separation for 
anechoic mixtures; 5 dB average separation for echoic mixtures; and can demix two or 
more sources from 2 mixtures. 

According to an aspect of the present invention, there is provided a method for 
blind source separation of multiple sources. The multiple sources are detected using an 
array of sensors to obtain data representative of the multiple sources. The data is 
represented by two mixtures having estimates of amplitude and delay mixing parameters. 
The estimates of amplitude and delay mixing parameters are updated, comprising the 
steps of: calculating a plurality of error measures, each of the plurality of error measures 



indicating a closeness of the estimates of amplitude and delay mixing parameters for a 
given source to a given time-frequency point in the two mixtures; and revising the 
estimates of amplitude and delay mixing parameters, based on the plurality of error 
measures. The two mixtures are filtered to obtain estimates of the multiple sources, 
comprising the steps of: selecting one of the plurality of error measures having a smallest 
value in relation to any other of the plurality of error measures, for each of a plurality of 
time-frequency points in the mixtures; and leaving unaltered any of the plurality of time- 
frequency points in the mixtures for which a given one of the plurality of error measures 
has the smallest value, while setting to zero any other of the plurality of time-frequency 
points in the mixtures for which the given one of the plurality of error measures does not 
have the smallest value, for each of the plurality of error measures. The estimates of the 
multiple sources are output. 

These and other aspects, features and advantages of the present invention will 
become apparent from the following detailed description of preferred embodiments, 
which is to be read in connection with the accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a computer processing system 100 to which the 
present invention may be applied according to an illustrative embodiment thereof; 

FIG. 2 is a flow diagram illustrating a method for blind source separation of 
multiple sources, according to an illustrative embodiment of the present invention; 

FIG. 3 is a flow diagram illustrating step 210 of the method of FIG. 2, according 
to an illustrative embodiment of the present invention; 
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FIG. 4 is a flow diagram illustrating step 240 of the method of FIG. 2, according 
to an illustrative embodiment of the present invention; 

FIG. 5 is a flow diagram illustrating step 250 of the method of FIG. 2, according 
to an illustrative embodiment of the present invention; 

FIG. 6 is a flow diagram illustrating step 270 of the method of FIG. 2, according 
to an illustrative embodiment of the present invention; 

FIG. 7 is a diagram illustrating a test setup for blind source separation on anechoic 
data, according to an illustrative embodiment of the present invention; 

FIG. 8 is a diagram illustrating a comparison of overall separation SNR gain by 
angle difference for the anechoic data, according to an illustrative embodiment of the 
present invention; 

FIG. 9 is a diagram illustrating the overall separation SNR gain by 30 degree 
angle pairing for the anechoic data, according to an illustrative embodiment of the present 
invention; 

FIG. 10 is a diagram illustrating a comparison of overall separation SNR gain by 
angle difference, using echoic office data in a voice versus noise comparison, according 
to an illustrative embodiment of the present invention; 

FIG. 1 1 is a diagram illustrating separation results for pairwise mixtures of voices, 
according to an illustrative embodiment of the present invention; and 

FIG. 12 is a diagram illustrating W-disjoint orthogonality for various sources, 
according to an illustrative embodiment of the present invention. 



DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 

The present invention is directed to online blind separation of multiple sources. It 
is to be understood that the present invention may be implemented in various forms of 
hardware, software, firmware, special purpose processors, or a combination thereof 
Preferably, the present invention is implemented as a combination of hardware and 
software. Moreover, the software is preferably implemented as an application program 
tangibly embodied on a program storage device. The application program may be 
uploaded to, and executed by, a machine comprising any suitable architecture. 
Preferably, the machine is implemented on a computer platform having hardware such as 
one or more central processing units (CPU), a random access memory (RAM), and 
input/output (I/O) interface(s). The computer platform also includes an operating system 
and microinstruction code. The various processes and functions described herein may 
either be part of the microinstruction code or part of the application program (or a 
combination thereof) that is executed via the operating system. In addition, various other 
peripheral devices may be connected to the computer platform such as an additional data 
storage device and a printing device. 

It is to be further understood that, because some of the constituent system 
components and method steps depicted in the accompanying Figures are preferably 
implemented in software, the actual connections between the system components (or the 
process steps) may differ depending upon the manner in which the present invention is 
programmed. Given the teachings herein, one of ordinary skill in the related art will be 
able to contemplate these and similar implementations or configurations of the present 
invention. 



FIG. 1 is a block diagram of a computer processing system 100 to which the 
present invention may be applied according to an illustrative embodiment thereof. The 
computer processing system 100 includes at least one processor (CPU) 102 operatively 
coupled to other components via a system bus 104. A read only memory (ROM) 106, a 
random access memory (RAM) 108, a display adapter 1 10, an I/O adapter 1 12, and a user 
interface adapter 1 14 are operatively coupled to the system bus 104. 

A display device 1 16 is operatively coupled to the system bus 104 by the display 
adapter 1 10. A disk storage device (e.g., a magnetic or optical disk storage device) 1 18 is 
operatively coupled to the system bus 104 by the I/O adapter 1 12. 

A mouse 120 and keyboard 122 are operatively coupled to the system bus 104 by 
the user interface adapter 1 14. The mouse 120 and keyboard 122 may be used to 
input/output information to/from the computer processing system 100. 

The present invention will now be described generally with respect to FIGs. 2-6. 
Subsequent thereto, more detailed descriptions of various aspects of the present invention 
are provided. It is to be noted that while any equations provided with respect to FIGs. 2-6 
may be so provided out of order; however, in the subsequent detailed description, the 
equations and corresponding text are provided sequentially in order of performance 
according to a preferred embodiment of the present invention. Given the teachings of the 
present invention provided herein, one of ordinary skill in the related art will readily 
contemplate variations of the sequence and actual details of the equations and 
corresponding steps described herein while maintaining the spirit and scope of the present 
invention. 
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FIG. 2 is a flow diagram illustrating a method for blind source separation of 
multiple sources, according to an illustrative embodiment of the present invention. The 
multiple sources are detected using an array of sensors to obtain two mixtures 
corresponding to the multiple sources (step 210). A frequency domain representation is 
computed for the two mixtures (step 220). 

Subsequent to step 220, the method proceeds to step 240 to update mixing 
parameters of the two mixtures, and then the method proceeds to step 250 and filters the 
two mixtures. 

Subsequent to step 250, a time domain representation is computed for the two 
mixtures (step 260). Estimates of the (original) multiple sources are output (step 270). 

It is to be appreciated that the steps directed to computing the time domain and 
frequency domain representations (steps 220 and 260, respectively) are readily performed 
by one of ordinary skill in the related art. Nonetheless, for further detail on such 
computations, the reader is referred to Deller et al, "Discrete-Time Processing of Speech 
Signals", IEEE Press, pubs., 2000. The other steps are described in further detail below 
with respect to FIGs. 3-6. 

FIG. 3 is a flow diagram illustrating step 210 (obtain mixtures) of the method of 
FIG. 2, according to an illustrative embodiment of the present invention. The multiple 
sources to be separated are mixed to obtain two mixtures xl, x2 (step 310), expressed as 
follows: 

*!(0 (1) 
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(2) 



where Nis a number of the multiple sources; Sj is an arrival delay between the array of 
sensors resulting from an angle of arrival; a } is a relative attenuation factor corresponding 
to a ratio of attenuations of paths between the multiple sources and the array of sensors; 
Sj (t) is a j th source; j is a source index ranging from 1 to N, where N is a number of the 

multiple sources; and t is a time argument. We use A to denote the maximal possible 
delay between sensors, and thus, Sj < A , V/ . 

It is to be appreciated that an "array of sensors" is referred to herein with respect 
to detecting multiple sources. The array of sensors may include 2 (a pair) or more 
sensors. 

FIG. 4 is a flow diagram illustrating step 240 (update mixing parameters) of the 
method of FIG. 2, according to an illustrative embodiment of the present invention. In 
general, the method of FIG. 4 updates the amplitude and delay estimates so that they 
better explain the measured data. The input information to the method is the time- 
frequency representations of the mixtures. 

Upon receiving the k th block of data corresponding to the frequency domain 
representation of a window of data centered at T k =kx A where x A is the time separating 
adjacent windows of the first mixture x\ and the second mixture x 2 , p(a j9 S j7 a),T k ) , 
which represents a distance measure for how well the current guesses as to what are the 



mixing parameters of the multiple sources are in light of the current data, is computed for 
each j=l,...,N as follows (step 410): 



1 



p(aj,Sj,co,T k ) =—4- X, {co^^a/" 05 ' -X 2 {a>,T k ) 

j 



(10) 



where aj and 5j are current estimates of amplitude and delay mixing parameters, 
respectively; ^(co^) and X 2 (co/Ck) are time-frequency representations of a first mixture 
and a second mixture of the two mixtures, respectively; k is a current time index; Tk is a 
time argument corresponding to a k th time index; co is a frequency argument; and j is a 
source index ranging from 1 to N, where N is a number of the multiple sources. 

It is to be appreciated that p{a jy 8 j ,co y T k ) is an error measure that calculates how 

much a given time-frequency point in the mixtures is explained by a particular guess of 
the j th source's mixing parameters. The closer that p{a- , 8 } ,a>,r k ) is to zero, the better it 

explains the time-frequency content of the mixtures, and the more likely the particular 
guess is the correct guess. 

It is to be noted that each p(aj , 8. ,a),T k ) term (wherein j = 1 , . . . ,N) depends on aj 

and 5j, the current estimates of the relative amplitude and delay parameters, respectively. 
Moreover, each p(a jy 8j,o) y r k ) term measures a distance assessing how well the data 

matches the j th mixing parameter estimate. The smaller the distance, the better the j th pair 
of mixing parameters explains the corresponding time-frequency point in the mixtures. 
The p(a jy 8j,G),T k ) play an important role in the updating of the amplitude and delay 
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estimates, both of which are computed in two steps. First, we estimate by how much and 
in which direction we should change each estimate, this is the calculation in Equations 

(19) and (18) for amplitude and delay, respectively. The better the data is explained by a 
given amplitude-delay estimate, the larger the effect of that data on the resulting change 
calculation. Second, the new amplitude and delay estimates are calculated via Equations 

(20) and (21), respectively. The direction of the change has been calculated in Equations 
(19) and (18) and the magnitude of the change is scaled by a learning rate constant beta 
and a variable learning rate calculated using Equations (22) and (23). The variable 
learning rate is calculated such that estimates that are explaining more data that they 
previously did have a higher learning rate and estimates that are explaining less data that 
they previously did have a lower learning rate. Estimates that explain roughly the same 
amount of data over time (that is, not an increasing or decreasing amount) have roughly a 
constant learning rate. 

Upon computing p(a i ,5 ! ,G),T k ), the following are computed — (step 420), 

dcij 

<jj[k] (step 430), and ^ (step 440). 

as } 

— , which represents the direction and magnitude of change in the current 

daj 

estimate of the j source's amplitude mixing parameter causing the greatest change in 
how well the amplitude estimate describes (corresponds to) the data, is computed as 
follows (step 420): 
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{{{a) -l)Re{*i (a> 9 t k )X 2 <p 9 T k )e**' } 
+a y (|X 1 (^r,)| 2 +|X 2 ( 6 > 9 r A )| 2 )) 



(19) 



where A: is a current time index; x k is a time argument corresponding to a k time index; aj 
is the current estimate of amplitude mixing parameter for the j th source; Sj is a current 
estimate of amplitude mixing parameter for the j th source; Xj((D,Tk) and X>(o,T k )are 
time-frequency representations of a first mixture and a second mixture of the two 
mixtures, respectively; to is a frequency argument; j and / are source indexes ranging from 
1 to N, where N is a number of the multiple sources; p{a } ,Sj,co,T k ) is an error measure 

for the j th source; X is a smoothness parameter; and Re is a function that returns a real part 
of a complex number. 

cy[k], which represents the amount of mixture energy which is explained (defined) 
by the j th source's amplitude and delay mixing parameters, is computed as follows (step 
430): 



^l(^T k )\\X 2 ((0,T k )\ 



(22) 



where A: is a current time index; T k is a time argument corresponding to a k time index; aj 
and 8j are current estimates of amplitude and delay mixing parameters, respectively for 
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the j th source; ^(co^k) and X 2 (©,T k ) are time-frequency representations of a first 
mixture and a second mixture of the two mixtures, respectively; co is a frequency 
argument; j and / are source indexes ranging from 1 to N, where N is a number of the 
multiple sources; and X is a smoothness parameter. 

> which represents the direction and magnitude of change in the current 

estimate of the j th source's delay parameter causing the greatest change in how well the 
delay estimate describes (corresponds to) the data, is computed as follows (step 440): 

dJ(r k ) ^ e^ s ^ -Icoa. < — « > 

l + a i ^{X^r k )X 2 {m^ (18) 

where A: is a current time index; x k is a time argument corresponding to a k th time index; ctj 
is a current estimate of amplitude mixing parameter for the j source; 5j is the current 
estimate of delay mixing parameter for the j th source; X x (a>,x k ) and X 2 (o>,x k ) are time- 
frequency representations of a first mixture and a second mixture of the two mixtures, 
respectively; co is a frequency argument; j and / are source indexes ranging from 1 to N, 
where N is a number of the multiple sources; p{a j ,d j ,a),T k ) is an error measure for the 

j th source; A, is a smoothness parameter; and Im is a function that returns an imaginary 
part of a complex number. 

Subsequent to step 430, a } [k] , which represents the time dependent parameter 

(variable) learning rate, is computed as follows (step 450): 

13 



where qj[k] represents an amount of mixture energy that is defined by estimates of 
amplitude and delay mixing parameters for a j th source; y is a forgetting factor; m is a 
time index ranging from 0 to a current time index k; and j is a source index ranging from 
1 to N, where N is a number of the multiple sources. 



da, 



Upon computing ^^±1 and a y [k] , mixing parameter estimate a } [k] is updated 
as follows (step 460): 



aj [k] = aj [k-l]-fiaj[k]?lj& (20 ) 



Upon computing dJ ^ T ^ anda, [k] , mixing parameter estimate 5 . [k] is updated 

38 j 

as follows (step 470): 



Sj[k] = djlk-lhpSjlk]^- (21) 



where, for steps 460 and 470, aj[k] and 6j[k] are the estimates of amplitude and delay 
mixing parameters for a j th source at a time index k, respectively; 6 is a learning rate 
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constant; represents the magnitude and the direction of change in a current 

daj 

estimate of amplitude mixing parameter for a j th source that causes a largest change in 

correspondence between the current estimate and the data; — represents the 

35 j 

magnitude and the direction of change in a current estimate of delay mixing parameter for 
a j th source that causes a largest change in correspondence between the current estimate 
and the data; r k is a current time; k is a current time index; and j is a source index ranging 
from 1 to N, wherein N is a number of the multiple sources. 

FIG. 5 is a flow diagram illustrating step 250 (filter) of the method of FIG. 2, 
according to an illustrative embodiment of the present invention. 

Using p(a J9 S j9 a) 9 T k ) , time-frequency masks are computed for the estimation of 

the j th source as follows (step 510): 



{1 p(a ;9 S ;9 o) 9 T,)<p(a m9 S m9 co 9 T k ) Vm^j , v 

0 otherwise 



where Qj(co,T k ) is a time-frequency mask; p(a J9 S J9 a) 9 T k ) is an error measure for a j th 

source; j and m are source indexes ranging from 1 to N, where N is a number of the 
multiple sources; x k is a current time; k is a current time index; aj and Sj are current 
estimates of amplitude and delay mixing parameters for the j th source, respectively; and (0 
is a frequency argument. 
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It is to be appreciated that, as noted above, the closer p(a j9 8 J9 o) 9 r k ) is to zero, the 

better it explains the time-frequency content of the mixtures, and the more likely the 
particular guess is the correct guess. Thus, for each time-frequency point, one of the 
guesses ha to be the correct one, so we choose the smallest value of p(a J9 5 J9 o) 9 r k ) as it 

best explains the data; this is the selection done in Equation (24). 

Using the time-frequency masks, the first mixture x\ and the second mixture x 2 are 
filtered to obtain estimates of the (original) multiple sources as follows (step 520): 

Sj (a> 9 r k ) = Clj {a>,T h )X x (a> 9 T h ) (25) 

where Sj(co,x k ) is an estimate of a time-frequency representation of a j th source; j is a 
source index ranging from 1 to N, where N is a number of the multiple sources; Qj((o,Tk) 
is a time-frequency mask; X x (a>,T k ) is a time-frequency representation of a first mixture 
of the two mixtures; co is a frequency argument; x k is a current time; k is a current time 
index. 

Thus, it is to be appreciated that at step 520, we take all the time-frequency points 
for which p{a } , 8 } ,co,T k ) was the best match (that is, where p{a } , Sj 9 ® 9 T k ) has the 
smallest value, and leave these time-frequency points unaltered while we set to zero (or 
some low threshold) all the time-frequency pints in the mixtures for which p{a j , 8. 9 co 9 z k ) 
was not the smallest. This time-frequency filtered version of the mixtures is the estimate 
of the time-frequency representation of the first source. We repeat this for each j = 
1,. . .,N to obtain the N original source estimates. This is the filtering of Equation (25). 
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FIG. 6 is a flow diagram illustrating step 270 (output estimates) of the method of 
FIG. 2, according to an illustrative embodiment of the present invention. 

A dual window function is applied to the estimates of the (original) multiple 
sources obtained at step 520 of FIG. 5 to reconstruct the multiple sources from the 
estimates (step 610). 

A description of mixing parameter estimation will now be given, according to an 
illustrative embodiment of the present invention. Such description will include 
descriptions of source mixing, source assumptions, amplitude-delay estimation, and ML 
mixing parameter gradient search. It is to be appreciated that, for the sake of brevity, 
definitions of terms appearing in the equations herein below will not be repeated; such 
definitions have been provided with respect to FIGs. 2-6 herein above or are readily 
ascertainable by one of ordinary skill in the related art. 

Accordingly, source mixing, which is associated herein with mixing parameter 
estimation, will first be described, according to an illustrative embodiment of the present 
invention. Consider measurements of a pair of sensors where only the direct path is 
present. In this case, without loss of generality, we can absorb the attenuation and delay 
parameters of the first mixture, xj(t) 9 into the definition of the sources. The two mixtures 
can thus be expressed as, 

x l (t)=f d s J (t), (1) 
x 2 (t) = f d a j s j (t-S j ) 9 (2) 

7=1 
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where N is the number of sources, Sj is the arrival delay between the sensors 
resulting from the angle of arrival, and aj is a relative attenuation factor corresponding to 
the ratio of the attenuations of the paths between sources and sensors. We use A to 

denote the maximal possible delay between sensors, and thus, Sj < A , V / . 

A description of source assumptions, which are associated herein with mixing 
parameter estimation, will now be given according to an illustrative embodiment of the 
present invention. 

We call two functions sj(t) and s 2 (t) W-disjoint orthogonal if, for a given 
windowing function W(t) 9 the supports of the windowed Fourier transforms of sj(t) and 
s 2 (t) are disjoint. The windowed Fourier transform of s/t) as defined-, 

F w ( Sj 0) (0 9 r ) = £ Wit - T)Sj{t)e im dt, (3) 

which we will refer to as Sj(co, r) where appropriate. The W-disjoint orthogonality 
assumption can be stated concisely, 

S 1 ((o,t)S 2 (o),t)=0^0),t. (4) 

In Appendix A, we introduce the notion of approximate W-disjoint orthogonality. 
When W(t) = 1, we use the following property of the Fourier transform, 
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(5) 



We will assume that (5) holds for all 8,\8\ < A , even when W(t) has finite support. This 
assumption is further described by Balan et al., in "The Influence of Windowing on Time 
Delay Estimates", Proceedings of the 2000 CISS, Princeton, NJ, March 15-17, 2000. 

A description of amplitude-delay estimation, which is associated herein with 
mixing parameter estimation, will now be given according to an illustrative embodiment 
of the present invention. Using the above assumptions, we can write the model from (1) 
and (2) for the case with two array elements as, 



X 2 (0,r) 



a,e 



-ia5. 



... a N e 



(6) 



For W-disjoint orthogonal sources, we note that at most one of the N sources will 
be non-zero for a given (T, I), thus, 







1 


_X,0,r) 




-w6, 

a.e J 



Sj (&,r), for some/. 



(7) 



The original DUET algorithm estimated the mixing parameters by analyzing the 
ratio of X x (<y,r) and X 2 («,r) . hi light of (7), it is clear that mixing parameter estimates 
can be obtained via, 
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(a(co,T),S((0,T)) = 



X 2 ia>,r) 



X x {o>,t) 



■ Im 



CO 



V 



(8) 



The original DUET algorithm constructed in 2-D histogram of amplitude-delay 
estimates and looked at the number and location of the peaks in the histogram to 
determine the number of sources and their mixing parameters. The 2-D histogram is 
further described by Jourjine et al., in "Blind Separation of Disjoint Orthogonal Signals: 
Demixing N Sources from 2 Mixtures", in Proceedings of the 2000 IEEE International 
Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, June 2000, 
vol. 5, pp. 2985-88. 

A description of a maximum likelihood (ML) mixing parameter gradient search, 
which is associated herein with mixing parameter estimation, will now be given 
according to an illustrative embodiment of the present invention. For the online 
algorithm, we take a different approach. First note that, 



X l (a> 9 t)a J e lm ''-X 2 {(i> 9 r) 2 = 0 9 (9) 



if source j is the active source at time- frequency (g),t) Moreover, defining, 



p(a J9 S j9 (o 9 T) = 7~t|^i (0 - X 2 (® ,r)| 2 (10) 
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we can see that, 



Yrmn(p(a v S l ,co,T\...,p(a N ,5 N ,<i),T))= 0, (11) 



because at least one p{a p 5 p (0,x) will be zero at each frequency. In the Appendix, it is 
shown that the maximum likelihood estimates of the mixing parameters satisfy, 



min Ynnn(p(a^d ly (o,r) 9 ,.^p(a N ,S N ,o),T)) . (12) 

0\A>*"> a N> S N m 



We perform gradient descent with (12) as the objective function to learn the mixing 
parameters. In order to avoid the discontinuous nature of the minimum function, we 
approximate it smoothly as follows, 



mm(p ly p 2 ) = (13) 



p^p 2 -<j>{p x -p 2 ) (14) 
2 



^-\n{e x * + e kp >) (15) 



where, 
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*)= f J^ dt = x+ J m+e ~ l) (16) 



Generalizing (15), the smooth ML objective function is, 



J(r)= min Y-I]n(<f*' ( «'*-"' r) +...+,f (17) 



which has partials, 



dJ(r) ^ e-W^M) -2coaj < — i0)SiX 

— -^-=> — — — — r- ^-Im \X. {co,T)XA(Q,T)e '\ (18) 



and, 



<?J(r) v e XPj 



(((aj - 1) Re {X x (a,T)X 2 ((D,T)e- M > } 

+a J (\X,(6>,T)\ 2 +\X 2 (a),T)\ 2 )) (19) 

We assume we know the number of sources we are searching for and initialize an 
amplitude and delay estimate pair to random values for each source. The estimates 
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(aj[k],Sj[k]) for the current time x k = kr A (where r A is the time separating adjacent time 
windows) are updated based on the previous estimate and the current gradient as follows, 



aj[k] = aj[k-l]-fiaj[k]^^ (20) 



Sj[k] = Sj[k-l]-fiSj[k]^- (21) 



where B is a learning rate constant and ctjfkj is a time and mixing parameter dependent 
learning rate for time index k for estimate j. In practice, we have found it helpful to 
adjust the learning rate depending on the amount of mixture energy recently explained by 
the current estimate. We define, 



CO ^ 



e 



and update the parameter dependent learning rate as follows, 



where y is a forgetting factor. 
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A description of demixing will now be given, according to an illustrative 
embodiment of the present invention. In order to demix the / h source, we construct a 
time-frequency mask based on the ML parameter estimator (see (B) in the Appendix), 



1 piaj,8j,ci>,r)<p(a m ,8 H ,&,T) Vm*j 



(24) 



0 otherwise 



The estimate for the time-frequency representation of the j source is, 



Sj (cd.t) = Qj (g>,t)X 1 (q),t) 



(25) 



We then reconstruct the source using the appropriate dual window function. The 
preceding is further described by I. Daubechies, in "Ten Lectures on Wavelets", ch. 3, 
SIAM, Philadelphia, PA, 1992. In this way, we demix all the sources by partitioning the 
time-frequency representation of one of the mixtures. Note that because the method does 
not invert the mixing matrix, it can demix all sources even when the number of sources is 
greater than the number of mixtures (N > M). 

A description of tests performed with respect to an illustrative embodiment of the 
present invention will now be given. We tested the method on mixtures created in both 
an anechoic room and an echoic office environment. The algorithm used parameters 6 = 
0.02, y = .95, X = 10 and a Hamming window of size 512 samples (with adjacent 
windows separated by 128 samples) in all the tests. For all tests, the method ran more 
than 5 times faster than real time. 
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FIG. 7 is a diagram illustrating a test setup for blind source separation on anechoic 
data, according to an illustrative embodiment of the present invention. Microphones are 
separated by ~L75 cm centered along the 180 degree to 0 degree line. The X's show the 
source locations used in the anechoic tests. The O's show the locations of the sources in 
the echoic tests. Separate recordings at 16kHz were made of six speech files (4 female, 2 
male) taken from the TMIT database played from a loudspeaker placed at the X marks in 
FIG. 7. Pairwise mixtures were then created from all possible voice/angle combinations, 
excluding same voice and same angle combinations, yielding a total of 630 mixtures (630 
= 6 x 5 x 7 x 6/2). 

The SNR gains of the demixtures were calculated as follows. Denote the 
contribution of source j on microphone k as Sjk (a), r). Thus we have, 



X x (m 9 r) = S n (m 9 t)+S 21 (a> 9 T) 
X 2 (o) ,r) = S l2 (fl) ,r) + S 22 (© , r) 



(26) 
(27) 



As we do not know the permutation of the demixing, we calculate the SNR gain 
conservatively, 



SNR, = max 



max 



lOlog 



lOlog 
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SNR, = - min 



lOlog 



a, 




2 


a, 


s 2i 


i 2 



.10 log 



Cl 2 S l2 



Q 2 ^ 



min 



10 log 





2 


S 2l 


2 



10 log 



>22 / 



In order to give the method time to learn the mixing parameters, the SNR results do not 
include the first half second of data. 

FIG. 8 is a diagram illustrating the average SNR gain results for each angle 
difference for the anechoic data, according to an illustrative embodiment of the present 
invention. That is, FIG. 8 illustrates a comparison of overall separation SNR gain by 
angle difference, according to an illustrative embodiment of the present invention. As an 
example, the 60 degree difference results average all the 10-70, 40-100, 70-130, 100-160, 
and 130-190 results. Each bar shows the maximum SNR gain, one standard deviation 
above the mean, the mean (which is labeled), one standard deviation below the mean, and 
the minimum SNR gain over all the tests (both SNRi and SNR 2 ) are included in the 
averages). The separation results improve as the angle difference increases. FIG. 9 is a 
diagram illustrating the 30 degree difference results by angle comparison for the anechoic 
data, averaging 30 tests per angle comparison, according to an illustrative embodiment of 
the present invention. That is, FIG. 9 illustrates the overall separation SNR gain by 30 
degree angle pairing, according to an illustrative embodiment of the present invention. 
The performance is a function of the delay. That is, the worst performance is achieved 
for the smallest delay (corresponding to the 10-40 mixtures), and so forth. 
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Recordings were also made in an echoic office with reverberation time of 
~500ms, that is, the impulse response of the room fell to -60 dB after 500 ms. For the 
echoic tests, the sources were placed at 0, 90, 120, 150, and 180 degrees (see the O's in 
FIG. 7). FIG. 11 is a diagram illustrating separation results for pairwise mixtures of 
voices (4 female, 4 male), according to an illustrative embodiment of the present 
invention. Separation results for pairwise mixtures of voices (4 female, 4 male) and 
noises (line printer, copy machine, and vacuum cleaner) are shown in FIG. 10. That is, 
FIG. 10 is a diagram illustrating a comparison of overall separation SNR gain by angle 
difference, using echoic office data in a voice versus noise comparison, according to an 
illustrative embodiment of the present invention. The results are considerably worse in 
the echoic case, which is not surprising as the method assumes anechoic mixing. 
However, the method does achieve 5dB SNR gain on average and is real-time. 





AW 


EW 


EVN 


Number of tests 


630 


560 


480 


Mean SNR gain (dB) 


15.31 


5.09 


4.41 


Std SNR gain (dB) 


5.69 


3.34 


2.87 


Max SNR gain (dB) 


25.65 


15.18 


14.61 


Min SNR gain (dB) 


-0.21 


-0.42 


-0.50 



TABLE 1 



Summary results for all three testing groups (anechoic, echoic voice vs. voice, and echoic 
voice vs. noise) are shown in the Table 1. In the table, the following designations are 
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employed: AW = Anechoic Voice vs. Voice; EVV = Echoic Voice vs. Voice; and EVN 
= Echoic Voice vs. Noise. We have presented a real-time version of the DUET algorithm 
that uses gradient descent to learn the anechoic mixing parameters and then demixes by 
partitioning the time-frequency representations of the mixtures. We have also introduced 
a measure of W-disjoint orthogonality and provided empirical evidence for the 
approximate W-disjoint orthogonality of speech signals. 

APPENDIX A 

Appendix A describes the justification for the W-disjoint orthogonality of speech 
assumption employed herein, according to an illustrative embodiment of the present 
invention. Clearly, the W-disjoint orthogonality assumption is not exactly satisfied for 
our signals of interest. We introduce here a measure of W-disjoint orthogonality for a 
group of sources and show that speech signals are indeed nearly W-disjoint orthogonal to 
each other. Consider the time-frequency mask, 



1 201og(|51(©,r)|/|5 2 (fl>,r)|)>jc 



(28) 



0 otherwise 



and the resulting energy ratio, 




(29) 
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which measures the percentage of energy of source 1 for time-frequency points where it 
dominates source 2 by x dB. We propose r(x) as a measure of W-disjoint orthogonality. 
For example, FIG. 12 shows r(x) averaged for pairs of sources used in the demixing tests. 
We can see from the graph that r(3) > .9 for all three, and thus say that the signals used in 
the tests were 90% W-disjoint orthogonal at 3 dB. If we can correctly map time- 
frequency points with 3 dB or more single source dominance to the correct corresponding 
output partition, we can recover the 90% of the energy of the original sources. FIG. 12 
also demonstrates the W-disjoint orthogonality of six speech signals taken as a group and 
the fact that independent Gaussian white noise processes are less than 50% W-disjoint 
orthogonal at all levels. 

APPENDIX B 

Appendix B describes the ML Estimation for the DUET Model employed herein, 
according to an illustrative embodiment of the present invention. Assume a mixing 
model of type (1) - (2) to which we add measurement noise: 

N 

X 1 (G),T)=Y J q J (a>r)Sj(a),T)+v l (6),T) (30) 

7=1 

N 

X 2 (0,0= (0 ,r)Sj (&,t)+ v 2 (0,0 (31) 

The ideal model (l)-(2) is obtained in the limit v h v 2 ? 0. In practice, we make the 
computations assuming the existence of such a noise, and then we pass to the limit. We 
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assume the noise and source signals are Gaussian distributed and independent from one 
another, with zero mean and known variances: 

~N(0,a 2 I 2 ) 
Sj(m 9 t)-N(0 9 pj(m)) 

The Bernoulli random variables q } (® 9 x) 's are NOT independent. To accommodate the 
W-disjoint orthogonality assumption, we require that for each (© ,t) at most one of the 
q . (co , x) 's can be unity, and all others must be zero. Thus the TV-tuple 
(<fr (® , x ), . . . , q N {® , x )) takes values only in the set 

Q={(0, 0,...,0),(1,0,...,0),...,(0, 0,...,1)} 

of cardinality N 4- 1 . We assume uniform priors for these R. V. ' s. 

The short-time stationarity implies different frequencies are decorrelated (and 
hence independent) from one another. We use this property in constructing the 
likelihood. The likelihood of parameters (a u 81, ... , a N , 8#) given the data 
(X 1 ((o 9 x) 9 X 2 (co 9 x )) and spectral powers a 2 , p } (0 ) at a given x, is given by conditioning 
with respect to q s (0 , x ) *s by: 

L(a^ 9 5 l9 ... 9 a N9 8 N ;x) 



v, (©,0 
v 2 (®,t) 
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where: 



:= p(X 1 ( : ),X 2 (-)\a v d l ,...,a N ,d N ;t,<T 2 ,p j ) 



M=[x,{<D,x) X 2 ((0,x)](<J 2 1 2 + Pj{<»)T 



X,{w,r) 
X 2 (a>,r) 



and 



1 








[i «,/■*]. 


a. e J 







1 a y e 

a } e a) 



and we have defined q 0 (a ,r)= 1- 4* (® ' r )>-Po (® )= °» an<5? r o 0» )= h for 
notational simplicity in (32) in dealing with the case when no source is active at a given 

Next, the Matrix Inversion Lemma (or an explicit computation) gives: 



- M=- 



1 1 



o l a + p J ((o)(l+a j ) 

( Pj («>)\a j e ie ' 5 > X x (a ,t) - X 2 {m,t)\ 1 + 

c 2 ix x {a>,z)\ 2 + \x 2 {(0,4 2 )) 
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and 



det (a 2 1 2 + Pj (* ) r ») = a 2 + />»(!+ flj )) 



Now we pass to the limit a-»0. The dominant terms from the previous two equations 



are: 



a-e mdj X x (g> , x) - X 2 {& , t) 



and 



^.(oOCl+aJ) 

Of the N + 1 terms in each sum of (32), only one term is dominant, namely the one of the 
largest exponent. Assume n : co M> jo, 1, . „ , ivj is the selection map defined by: 

n (©)= ^ ? p(a k ,S k ,®,T) < p(a J9 8j,(D,T) \f j*k 



where: 
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/7(a o ^ o ,ffl,T)=|x i (»,r)| 2 + |x 2 (0,T)| 2 



and for fce{l,2,...,w} : 



L.e"'^X 1 (ffl,r)-X 2 (ffl,T) 2 
p(a k ,8 kt e> ,r) = jj^i 



'J. Then the likelihood becomes: 

ST 

L(a v 5 l ,...,a N9 8 N ;r)= 

TIFil 11 **«p - -i (33) 

O 

rf with M, the number of frequencies and: 



p t (fi))(l+a 2 ) 



k=0 
ke{l,2,...,N} 



The dominant term in log-likelihood remains the exponent. Thus: 



1 N 

log!*-— J S PiflkA*® ( 34 ) 
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and maximizing the log-likelihood is equivalent to the following (which is (12)): 



min y^rmn(p(a l9 S l9 (D y T) 9 ... 9 p(a N ,S N9 (D 9 T)) 

Although the illustrative embodiments have been described herein with reference 
to the accompanying drawings, it is to be understood that the present invention is not 
limited to those precise embodiments, and that various other changes and modifications 
may be affected therein by one of ordinary skill in the related art without departing from 
the scope or spirit of the invention. All such changes and modifications are intended to 
be included within the scope of the invention as defined by the appended claims. 
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